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ABSTRACT 

Strong gravitational lensing of an extended object is described by a mapping from source to image 
coordinates that is nonlinear and cannot generally be inverted analytically. Determining the structure 
of the source intensity distribution also requires a description of the blurring effect due to a point 
spread function. This initial study uses an iterative gravitational lens modeling scheme based on 
the semilinear method to determine the linear parameters (source intensity profile) of a strongly 
lensed system. Our 'matrix-free' approach avoids construction of the lens and blurring operators 
while retaining the least squares formulation of the problem. The parameters of an analytical lens 
model are found through nonlinear optimization by an advanced genetic algorithm (GA) and particle 
swarm optimizer (PSO). These global optimization routines are designed to explore the parameter 
space thoroughly, mapping model degeneracies in detail. We develop a novel method that determines 
the L-curve for each solution automatically, which represents the trade-off between the image 
and regularization effects, and allows an estimate of the optimally regularized solution for each lens 
parameter set. In the final step of the optimization procedure, the lens model with the lowest is 
used while the global optimizer solves for the source intensity distribution directly. This allows us 
to accurately determine the number of degrees of freedom in the problem to facilitate comparison 
between lens models and enforce positivity on the source profile. In practice we find that the GA 
conducts a more thorough search of the parameter space than the PSO. 
Subject headings: gravitational lensing: strong — methods: numerical 



1. INTRODUCTION 

Strong gravitational lens effects produce multiple dis- 
torted images of a background object and also provide 
magnification of lensed sources. Magnification may re- 
veal unresolved features in lensed sources and provides a 
useful tool for studying cosmologically distant objects. 
Furthermore, gravitational lensing provides a unique 
method to determine the mass distribution of lensing 
objects, which can be most accurately modeled when 
the lens potential is probed in parallel at many points. 
Therefore accurate lens inversion methods for extended 
sources are important because they provide a large num- 
ber of constraints on the lens mass distribution. 

Models of both the intensity profile of the source and 
the lens mass distribution are required to model a strong 
gravitational lens system. Analytical models of the 
source are sometimes used because they are typically de- 
scribed by a small number of parameters, and can ensure 
smoothness and positivity when used to model the source 
intensity distribution. However, the correct parameteri- 
zation is not always clear for such models, and the choice 
of a specific parametric model biases the lens and source 
solutions. Authors have attempted to partially overcome 
this drawback by using complex but flexible parametric 
models specified by large parameter sets. The most ex- 
treme example is Tyson et al. (1998), who used an elab- 
orate source model with more than 200 parameters to 
fit the gravitational lens CL0024 -I- 1654. In such cases, 
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it may be simpler to use pixelized source models, which 
treat all pixels on the source independently. 

The difficulty with pixelized source models is that they 
require many more free parameters than even the most 
complex analytical models. The optimization of ex- 
tended sources via pixelized intensity distributions is sim- 
pli fied using the ver s atile s emilinear method developed 
by iWarren and Dvd (|2003[ ) and l ater expanded upon 
by a n umber of authors, including iTreu and KoopmansI 
(|2004[ ). The semilinear method uses a pixelized source, 
and also incorporates the blurring due to the point- 
spread function (PSF) of the instrument used to obtain 
the data. Additive noise in the observational data is also 
taken into account by the semilinear method. 

In this paper we detail Mirage, a gravitational lens 
modeling code written in MATLAB and C. The present 
version of Mirage is designed to optimize the parameters 
of analytical lens models and pixelized sources, but work 
is underway to extend the code to handle non-parametric 
lens models as well. A modified version of the semi- 
linear method forms the backbone of our lens model- 
ing program. We use sophisticated global optimization 
methods to fit the lens parameters, and the semilinear 
method to determine the corresponding source light pro- 
file that best match es the data. As a fi nal st ep, we em- 
ploy the method of iBrewer and Lewis! (|2005[ ) to enforce 
the positivity of the source while keeping the nonlinear 
lens parameters constant. This affords a method of com- 
parison between distinct lens density models because the 
number of degrees of freedom is well-defined and fixed 
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(|Brewer and Lewiil2006D . The global optimizers studied 
in this paper consist of a sophisticated genetic algorithm 
(GA), called Ferret (jFiege et al.ll2004D . and an enhanced 
particle swarm optimizer (PSO), Locust, which are both 
components of the Qubist Global Optimization Toolbox 
by iFieg e (2010). This paper discusses a robust method 
for gravitational lens reconstructions, highlights the ben- 
efits of both types of optimization routines, and compares 
their performance. 

In Section [2] we will review the gravitational lens 
inverse problem, the semilinear method and our new 
matrix-free approach to lens modeling. In Section [3] we 
discuss the details of the GA and PSO, as well as a variety 
of simulated data tests. Section|4]presents our results us- 
ing these methods, and our conclusions are summarized 
in Section [5l 

2. THE GRAVITATIONAL LENS PROBLEM 

In this section, we describe the background of the grav- 
itational lens problem. Section [2 . 1 1 reviews the semilinear 
method, and Section [2?2] details our matrix- free method. 
A small scale test of our algorithm is discussed in Section 
12.31 and the implicit regularizing properties of iterative 
methods are described in Section Determination of 
the effective number of degrees of freedom is discussed in 
Section 12.51 and the L-Curve criteria in Section 12.61 Fi- 
nally, the results of our algorithm applied to a large-scale 
test are shown in Section [2771 

We invoke the standard thin lens geometry (jRefsdall 
I1964D . tiling the source plane with coordinates /3=(^,77) 
and the observed image plane with coordinate system 
6={x,y) (|SchneideHll985D . The thin lens equation maps 
light rays from the image plane to the source plane, such 
that 

(3 = e-Q.{6). (1) 

Equation ([T]) is nonlinear because the deflection angle 
depends on the image coordinates, and multiple solu- 
tions to the lens equation may ex ist when it is solved 
(jSchneider. Ehlers and Falcol |1992|) . This multi- valued 
property makes the lens mapping non-invertible in gen- 
eral, and the lens equation can be solved analytically for 
only a handful of simple lens models. 

It is straightforward to find the gravitationally lensed 
image of a background source in the absence of blurring, 
given a model of the source and a lens mass distribu- 
tion. The defiection angle ct.{6) determines the position 
of image pixels back-traced to the source plane, and the 
brightness of each image pixel is determined by conser- 
vation of surface brightness: 

s{6) = nmi (2) 

where S(0) represents the intensity at point {x^y) on 
the image plane and I](/3(0)) is the corresponding back- 
traced intensity at position (£(0), ri{6)) on the source 
plane (jKavser and S chramm 1988). In principle, it is 
also possible to use Equation ([5]) as a simple method to 
calculate the intensity distribution of the source from 
an image. However, methods based on the conserva- 
tion of surface brightness are complicated by the fact 
that many back-traced image pixels may land within 
any given source pixel due to the multiple imaging 



property of the lens mapping. When this occurs, the 
source pixel is assigned the mean intensity of the back- 
traced image points. This approach to lens model- 
ing is cal led the Digital Source Recon struction (DiSoR) 
method (jKavser and Schramml |1988l) , and derives the 
structure of a lensed source based on the observed pixel 
intensities given a lens model. The advantage of this 
approach is that the conservation of surface brightness 
(Equation ([2])) avoids solving the lens equation directly, 
since this would in general involve finding the solutions 
to a complicated nonlin ear equatio n at the p osition of 
each source pixel (jSchramm and Kavsei] ll9871. 

Despite their tempting simplicity, lens inversion 
schemes based on the DiSoR method cannot be used for 
most applications because they fail in the presence of sig- 
nificant distortion due to instrumental and atmospheric 
blurring described by a PSF. Readout and background 
noise further complicate the application of the conserva- 
tion of surface brightness by corrupting the pixel intensi- 
ties present in the data. These factors make it impossible 
to use the DiSoR approach to obtain the exact inversion 
of the lens system through ray-tracing, because multi- 
ply imaged pixels may no longer have identical intensi- 
ties, which invalidates the fundamental assumption of the 
method. When significant blurring or noise is present, a 
forward modeling approach is needed such that a source 
model is lensed using Equation ([2]) and convolved with 
the PSF for comparison with the data. The semilinear 
method is based on forward modeling and reduces the 
lens inversion to a least-squares type of problem. 

2.1. The Semilinear Method 

The semilinear method provides a way to solve for op- 
timal source intensities by the direct inversion of a lens 
matrix, for a given set of lens parameters. However, the 
search for optimal lens parameters is nonlinear in general, 
and requires more sophisticated nonlinear optimization 
methods, such as those discussed in Section 13.11 Fast 
execution of the matrix inversion part of the problem 
is crucial, because a linear system of equations must be 
solved for each set of lens parameters tested by the non- 
linear optimizer during the search for optimal lens mod- 
els. Many sets of lens parameters must be evaluated to 
search the parameter space thoroughly enough to deter- 
mine the gl obally optimal solution. 

Following I Warren and Dvd ()2003D . we label the image 
pixels j = 1..J and treat the pixels in the source as in- 
dependent free parameters i = Given a set of lens 
parameters, the image of each source pixel is formed by 
ray-tracing assuming unit brightness Si = \, and con- 
volved with the PSF. This transformation is encoded in 
a matrix f=BL. We assume linear blurring described 
by the blurring matrix B which accounts for the PSF. 
The matrix L performs ray-tracing via the lens equa- 
tion llj . The problem is then reduced to finding a set of 
source pixel scaling factors Si that minimize the reduced 
statistic between the model image and the observed 
data. Using the set of source pixel intensities, the lensed 
image of a source is found easily: 

bj^'^sjij, (3) 
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where fij are the elements of the matrix /. The 
statistic between the lensed image and the data is: 
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(4) 



where dj are the observed intensity in each image pixel, 
and (Tj is the standard deviation error associated with 
pixel j. After differentiating this equation with respect 
to the source pixel intensities Si, we define F = fij/<Jj, 
d = dj / (jj and we obtain the relation 



F^Fs 



(5) 



from which it follows that the source pixel scalings can 
be determined by linear inversion: 



s = M^d' 



(6) 



where d'=F^d and M=F'^F. This inversion deter- 
mines the optimal set of source pixel scalings neces- 
sary to reproduce the observed data for a given lens 
model. Further details of th is derivation can be found i n 
iWarren and Dvelp003D and lTreu and Koopmaili (|2004D . 
In the standard semilinear method, the system matrix M 
is very large, where the linear size of the matrix scales as 
the number of source pixels used in the inversion. The 
matrix is very sparse when the PSF is narrow, but a 
greater fraction of matrix elements are non-zero for in- 
creasingly broad PSFs. Mirage uses sparse matrix meth- 
o ds to minimize memory usage. 

IWarren and Dvd (|2003 D originally presented the semi- 
linear method as a ray-shooting alg orithm that performs 
neares t neighbor interpolation. iTreu and KoopmansI 
(j2004t ) modified the lens matrix to accommodate bilin- 
ear interpolation of the source. This consists of using 
the four source pixels surrounding a back-traced image 
pixel with appropriate weighting to assign a brightness 
value to each image pixel. Mirage currently implements 
nearest neighbor, bilinear, and bicubic source plane in- 
terpolation. Higher order interpolation schemes are pos- 
sible, but they are computationally more expensive and 
result in a lens mapping matrix that is less sparse. It is 
also possible to use more elaborate schemes to grid the 
source pla ne, including the Delaunay t esselation scheme 
used by Vege tti and Koopmanj ()2009[ ). We restrict the 
source models to rectilinear grids in this paper, but plan 
to explore other such options. 

In practice, regularization is necessary to stabilize the 
matrix inversion due to the presence of noise in the data 
(|Treu and KoopmansI 1200^ . This regularization term 
makes the system matrix M more diagonally dominant 
and hence better conditioned, which has the effect of in- 
creasing the smoothness of the source light distribution. 
In general, we add a regularization matrix to the system 
matrix, to give 



M' = M 



(7) 



where A is an adjustable regularization parameter. It is 
then possible to control the smoothness of the derived 
solution by adjusting the regularization parameter, with 



the unregularized case recovered as A — 0. Zeroth or- 
der regularization has H=I, which suppresses noise in 
the i nversion by prefer ring sources with less total inten- 
sity (jWarren and Dv j [2003) . It is also possible to in- 
troduce more complicated forms of regularization, typ- 
ically based on finite difference representations of two- 
dimensional derivative operators. 

It has been shown that different r egularizing terms pro- 
duce qualitatively similar results ()Treu and KoopmansI 
|2004( ). and the behavior of a host of linear regularization 
schem es has been studied in great detail by Suyu et al] 
(j2006( ) in the framework of Bayesian analysis. An im- 
portant drawback of regularization is that it introduces 
dependencies between source plane pixels, which makes 
it difficult to characterize the effective number of degrees 
of freedom required to compute the reduced iXr)- 
Therefore, direct comparison of different models is more 
diffic ult in regularized s cheme s than without regulariza- 
tion. iDve and WaTmll (|20Q5l ) use an adaptive mesh in 
the source plane to overcome the problem of calculating 
the number of degrees of freedom in the problem. 

An important advantage of the semilinear method is 
that errors of the source intensity parameters can be eas- 
ily determined from the lensing matrix, as seen from the 
relation 



2 d Sid Si 



(8) 



This equation expresses the lensing matrix as half of 
the Hessian matrix of th e reduced image statis- 
tic. IWarren and^y^ (pOOl use this relationship to find 
the covariance matrix C—M~^, thus determining the 
source plane errors automatically during application of 
the semilinear method. When regularization is used, 
the covariance matrix cannot be found in this way, but 
Warren and Dye (2003) proposed a Monte Carlo method 
as an alternative method to estimate errors. 

Despite its conceptual elegance, there are significant 
practical limitations and drawbacks to the semilinear 
method. The number of non-zero matrix elements of M 
scales linearly with the number of pixels in the source and 
depends on the source interpolation method used. Direct 
inversion quickly becomes impractical for very large im- 
ages, or those with large PSFs. This sparsity requirement 
can be fulfilled by representing the PSF by a simpler 
function, for example a Gaussian, and setting small val- 
ues to zero. This thresholding helps to control the poten- 
tially poor conditioning of the blurring matrix. However, 
realistic PSFs may contain significant low-level structure, 
and fitting a simple analytical function to it may not be 
desirable in such cases. The semilinear scheme is there- 
fore most practical when the image is small and the PSF 
is narrow, as in typical optical data. 

Another problem with the semilinear method is that it 
does not enforce the positivity of source pixel intensities. 
Optimal source solutions derived using this algorithm 
may contain negative pixel intensities, due to noise in 
the data, since bounds cannot be enforced in the matrix 
inversion step. Moreover, there is no form of linear reg- 
ularization that is guaranteed to prevent this behavior. 
We note, however, that it is possible to enforce positivity 
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in other lens modeling schemes, such as the maximum 
entrop y method (MEM), explored by iWallington et al] 

In summary, semilinear inversion provides a convenient 
method for modeling strongly lensed extended sources 
because it states the gravitational lens modeling problem 
using a least-squares approach that is solved by direct 
matrix inversion. The inversion step guarantees that the 
globally optimal solution is found for unbounded source 
pixel values. However, the method is computationally 
expensive for large images and PSFs. In such cases 
even building the transformation matrix /, incorporating 
both lensing and blurring effects, is an expensive compu- 
tation and the inversion step may be time-consuming or 
impractical due to the poor sparsity and size of the ma- 
trix. We show in Section [2^ that it is possible to derive 
a 'matrix- free' formulation that avoids the explicit con- 
struction of the matrix and dramatically improves the 
efficiency of solving for the linear parameters by employ- 
ing local optimization methods to solve the least squares 
problem. We also compare results from this technique 
with the semilinear inversion method and show that both 
methods produce solutions of similar quality. A final re- 
finement step ensures positivity of the source pixels, thus 
rectifying a limitation of the standard semilinear method. 

2.2. Matrix-free modeling of lensed images 

The main goals of Mirage are generality, flexibility 
and sufficient efficiency to allow thorough exploration of 
lens model parameter space by global optimization tech- 
niques, which typically require the evaluation of 10'* — 10"'' 
lens models. We therefore require a fast code to solve the 
linear part of the problem, which is able to function with 
arbitrarily complicated PSFs and data of high resolution. 
Mirage implements the semilinear method, using direct 
matrix inversion, but also extends this method by using 
faster and less memory intensive local iterative optimiza- 
tion routines that avoid the need for explicit construction 
of the lens and blurring operators. The iterative methods 
in Mirage are not intended as a replacement for the semi- 
linear method, which is the preferred approach when it 
is computationally practical. However, memory require- 
ments and long run times for the global lens parameter 
search may practically restrict the semilinear method to 
source intensity distributions and PSFs that can be built 
on a small mesh to limit the matrix size and maintain its 
sparsity. Our technique is intended to augment the least- 
squares approach of the semilinear method by providing 
a complement of algorithms capable of modeling large 
lens images quickly, even if the PSF is also large. We 
avoid the direct inversion of large matrices, but main- 
tain the least-squares formulation, allowing the use of 
any linear optimization algorithm suited to the solution 
of large-scale problems. Since this paper focuses on solv- 
ing the full nonlinear lens modeling problem, we largely 
make use of matrix- free methods for the remainder of this 
work except for comparisons with the direct semilinear 
method. 

Given the parameters of a lens model and assuming a 
source intensity distribution, matrix multiplication with 
L results in the unblurred lensed image. This image can 



also be found by the conservation of surface brightness, 
as given by Equation ([2]) . By storing the positions of the 
back-traced image pixels on the source plane, we perform 
an interpolation on the source plane directly, which al- 
lows us to find the unblurred lensed image without the 
need for an explicit representation of the lens matrix. 
Similarly, a separate algorithm in Mirage has an effect 
equivalent to multiplying by the transpose of the lens 
matrix, which works by carefully keeping track of the 
positions of back-traced image pixels. 

The lens mapping magnifies portions of the image 
plane by differing amounts such that square image pix- 
els mapped back to the source plane may no longer re- 
main square. This effect is especially pronounced when 
image pixels are traced back to the source plane near 
caustic curves. In general, the distortion in shape of the 
back-traced image pixels may cause portions of the back- 
traced pixels to lie within separate source pixels. To ac- 
count for this effect, we split each image pixel into Np 
subpixels, and trace each of these subpixels to the source 
plane independently. By interpolating each of these sub- 
pixels on the source mesh, we can then average over their 
intensities in the image plane to find a better estimate of 
the lensed intensity profile. Np can be any size, but the 
execution time of the code increases as we include finer 
subpixel resolution. This approach improves the accu- 
racy of the transformation from the lens to source plane. 
In addition, we find that it improves the smoothness of 
the surface, which helps with the global search for op- 
timal lens parameters. Image pixels that do not map to 
the source plane are not included in the local optimiza- 
tion and are assigned no intensity. 

To successfully model a realistic blurred observation, 
we incorporate the blurring effect of the PSF, which is 
usually described by the blurring matrix B. This ma- 
trix is of size Nf x Nf where Ni is the number of image 
pixels which map back to the source plane. Since the 
lens matrix L is sparse, avoiding its construction does 
not directly increase the speed of the code, but it allows 
us to sidestep the explicit construction of the blurring 
matrix. Since PSFs may describe a significant blurring 
effect, the product of the lens and blurring matrices / 
may have a large number of non-zero entries, decreas- 
ing the sparsity of the system. To avoid building the 
blurring matrix, we convolve with the PSF in Fourier 
space, which is computationally inexpensive, even for 
large images. Convolution without a blurring matrix 
is comm on in the irnage pr ocessing community and was 
used by iNagy et all (|2002f) to solve least-squares prob- 
lems in the context of the standard image deconvolution 
problem. By direct extension, this 'matrix-free' lensing 
method all ows us to solve the leas t-squares problem de- 
scribed bv lWarren and Dv3 ()2003n without the need for 
explicit representation of the matrices involved using lo- 
cal optimization. This approach not only provides a sub- 
stantial increase in speed, but also allows for the use of 
large data sets with complicated PSFs. In principle the 
matrix-free approach can be extended to spatial ly vari- 
ant PSFs using the techniques described by iNagv et al.l 
(2002). 

The local linear optimization algorithms considered in 
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this paper require an initial guess of the solution. We 
begin with a blank source prior, and each successive it- 
eration adds increasingly higher spatial frequency detail 
to this initial guess. Local optimizers such as the conju- 
gate gradient method for least squares problems (CGLS; 
(pjjorck 1996)) and steepest descent (SD) require one ma- 
trix multiplication and one matrix transpose multiplica- 
tion per iteration, so that the least squares problem can 
be solved without explicit representation of the lens ma- 
trix. We show that such a procedure converges in prac- 
tice to a source model that is very close to the solution 
found through matrix inversion, given a sufficient num- 
ber of iterations. Moreover, an explicit regularization 
term is not required in general since iterative optimiza- 
tion techniques have been shown to have an automatic 
regularizing effect on the problem (Vogcl 2002) , allowing 
Equation ([5]) to be minimized directly. 

Iterative schemes have been used previously in the 
strong lensing literature in a pplication to noii li near r eg- 
ularization, for example by iWallington et al.l ()1996[ ) in 
the lensMEM method. A similar approach is used i n 
the LENSVIEW code by |Wayth and Webster ('2006), 
which utilizes the MEM discussed by Shilling and Bryan 
(jl984[ ). The semilinear method is restricte d to linear 
regularization terms of the type detailed in iSuvu et al.l 
|2i006). Nonlinear regularization like the MEM can 
also be used with Mirage, although these techniques re- 
quire more complicated nonlinear optimization schemes 
to solve for the source intensity distribution. 

2.3. A Small-scale Test 

As a small scale test problem, we generated a 120 x 120 
pixel image of an analytical source intensity profile. The 
image pixel size used in this test is 0.03 arcsec. The test 
source is defined by a two-armed spiral test function, 
given by Bonnet (1995}: 

S{r, uj) = — -exp[-2 sin^ (a; - wo - Tr^)] , (9) 

where So is the maximum brightness in arbitrary units 
and core radius r^. The tightness of the arms about 
the central bulge is controlled by r, and ujq controls the 
orientation of the spiral, in standard polar coordinates 
(r, w). The lensed image of this function can be compli- 
cated, since the function contains significant structure, 
and therefore provides a good test of the level of detail 
that we are able to recover using our lens inversion algo- 
rithm. A test image is generated using the approach de- 
tailed above, with each image pixel composed of a 10 x 10 
grid of subpixels. The high subpixel resolution mimics 
the smooth structure of a natural image. We blur the re- 
sulting image by convolving with a Gaussian PSF, with 
a FWHM of 5 pixels on a 30 x 30 grid, and add Gaus- 
sian noise to construct our final test image, as shown in 
Figure [TJ Our simulated data set has a signal-to-noise 
ratio S/N = 8, where we define the S/N as the maxi- 
mum image intensity divided by the standard deviation 
of the additive Gaussian noise. The lens used to produce 
this i mage is the six-parameter sing ular isothermal ellipse 
(SIE: lKeeton and Kochanel] (|1998l )') which has Cartesian 



defiection angles given by 

a. = ^^t.n-^(^-^^^] (10) 

ay — , tanh"^ I o — I , (H) 

where = + x^) + and q = ^(1 — e)/{l + e), 

and b is the corresponding Einstein radius in the limit of 
a spherical model with q = 1. The parameter h is related 
to the velocity dispersion a^, by 

\ c) Dos ^ ' 

where Dig and Dos are the angular distances between lens 
and source and observer and source respectively, and c 
is the speed of light. The actual parameters used to 
construct Figure[T]are as follows: The velocity dispersion 
is (Ty — 260 km s~^ resulting in b = 1.35 arcsec, ellipticity 
e = 0.4, lens center {x,y) — (0,0.12), orientation angle 
9l = Tr/2. We keep the core size fixed at s = 0. In 
addition to these six parameters, we assume that the 
redshift of the lens and source are Zd = 0.3 and Zs = 
1.0 respectively. For convenience we measure angular 
distances with respect to the "flat" Friedmann metric 
with k = 0. 

We model the data using the semilinear method and 
matrix free methods with subpixel grids of size 2x2. 
The sub-pixel resolution used for modeling is lower than 
that used to produce the data, which makes the test 
more realistic. The source plane is defined on a 40 x 40 
grid, with source pixels of size 0.024 arcsec. The original 
simulated-data image is shown in the first row of Figure 
[U the semilinear reconstruction on the second row, and 
a reconstruction using the CGLS algorithm on the third 
row. CGLS was chosen as the linear optimizer because of 
its speed and popularity as a local optimization scheme, 
but in practice all of the optimizers included in the Mi- 
rage package produce similar results. All of the local 
optimization algorithms tested are able to recover the de- 
tails of the original source function well. Figure shows 
that the rate of convergence varies between optimization 
algorithms, but they all settle down to the minimum re- 
duced image Xr values to within 5% by generation 40. 
The semilinear method is displayed on this plot simply 
as a constant Xr because it is a direct method. Note 
that all of the source reconstructions show noise back- 
traced from the image, which is unavoidable using pixel 
mapping techniques on data containing noise. All local 
optimization algorithms converge in approximately 2 s, 
while the semilinear method required approximately 16 
s. This test was conducted on a 2.4 GHz dual-core Intel 
machine with 3 GB of memory. Memory usage was mon- 
itored and did not exceed the hardware memory limit at 
any time. 

In general, the lensed model image becomes increas- 
ingly well matched to the data the longer an iterative 
optimizer runs, but the usefulness of the solutions even- 
tually starts to degrade as the algorithm begins fitting 
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to the noise in the data. Therefore, the corresponding 
noise level in the source reconstruction rises as iterations 
continue, which we can quantify for this test problem be- 
cause we know the true solution in the absence of noise 
as shown in Figure [3l In effect, the number of iterations 
of the local optimizer acts as a regularization parameter 
([Flemi ng 1990). Thus, it is possible to introduce regular- 
ization by carefully controlling the number of iterations 
during local optimization. In general, implicit regulariza- 
tion is present whenever these local optimizers are used 
in the context of deblurring problems, which implies that 
suitable stopping criteria must be established to find the 
optimally regularized solution ([Hansen et al.l [20061 ) . It 
is noteworthy that the semilinear method suffers from 
a related problem since the regularization constant is a 
free parameter, and therefore the associated ambiguity is 
equivalent to the problem of choosing a stopping criteria 
in iterative methods. Techniques have been developed 
to deal with this problem using Ba yesian analysis for 
the semilin ear method, as discussed bv lBrewer and Lewis! 
dlbOe) and lSuvu et"all ([2OO60 . For iterative methods, the 
issue of a stopping criteria is a non-trivial problem that 
has no unique solution for local optimization, although 
many methods exist to deal wi th this problem, su ch as 
Generalized Cross Vali dation ([Wahba et al.l I1979D and 
the L-curve criterion ([E ngl et al.l 120001) " We discuss a 
novel approach in Section [2.6l that uses the L-curve anal- 
ysis to estimate the optimal regularization parameter 
(stopping condition) in conjunction with global parame- 
ter search methods. 

2.4. Iterative Optimization as Implicit Regularization 

To see how iterative schemes produce implicit regular- 
ization, consider a system b=Bx+n, where n describes 
the noise added to the true i mage. Suppose that the 
blurring matrix B is ill-posed ([HansenI 119971 ). and the 
"true" solution is the unblurred image, represented as a 
vector X. Given the blurring matrix and the noisy data 
5, we can formally write an approximate solution to the 
inverse problem as x—B^b. However, this proves to be 
difficult in practice because of the poor conditioning of 
B and the noise contained in the data. The resulting 
solution is the sum of two terms, a;„ =x+B~^n. The 
second term can dominate the first in this expression, 
which results in poor recovery of the true solution, x. 
To overcome this problem, regularization schemes seek a 
solution to the system 

XX = argmin (||b- Sa;|p + \\\H {x - Xq) |p) (13) 

where H is the regularization matrix, B is the system 
matrix and & is a vector of data to be fit. The regular- 
ized solution is xx, and the default solution is Xq, which 
is found when A — > oo. By requiring the derivative of 
Equation ([13p to vanish we derive the following expres- 
sion 

(jB^B + \H^H^ X = XH^Hxq + B'^b (14) 

provided that the regularization is linear in nature. Note 
that the first term on the right depends on the default 
solution ccq, which represents a bias in general. For the 



remainder of this report we set the default solution to 
zero, which is reasonable since most astronomical images 
are largel y composed o f pixels corresponding to blank 
sky (Brew er" and Lew"isl [2b06). 

The solution to this system can also be found by con- 
sidering the singular value decomposition (SVD) of a ma- 
trix B= U'I!,V^ , where U and V ar e orthogonal N x N 
matrices ([Golub and ReinschI Il970[ ) . The matrix S is 
diagonal, containing the non-increasing singular values 
1^1 > 1^2 > •■• > fn- The columns of U are a set of or- 
thogonal vectors Ui, and the orthogonal columns of V 
are denoted by Vi, which leads us to the expression 

a, = J2^v,. (15) 

The small singular values (large values of i) correspond 
to the addition of high frequency noise, and the terms 
involving the smallest singular values I'i tend to dominate 
the solution. The singular values Vi and the expansion 
coefficients \ufb\ as a function of the number of terms 
are shown in Figure[4l These plots are called Picard plots 
and show that an increased contribution to the noise in 
the reconstruction is found as the singular values become 
smaller than the magnitude of the expansion coefficients. 

The goal of a regularization scheme is to limit the 
amount of noise that contributes to the solution. In 
principle, the simplest scheme is to truncate Equation 
([T^ for sufficiently large values of i to limit the amount 
of high frequency noise in the solution. Truncation of 
the SVD expansion can be accomplished by multi plying 
the t erms of Equation ([T5|) by a "filter factor" ([Vogell 
[19891) defined by 

^ u^b 

^fM = ^04^— Vj, (16) 

where takes the form of a Heaviside function such that 
4>i — I for singular values below the cutoff point fc, and 
4>i = for terms with i > k. In this way, the contribution 
of high-frequency noise to the solution can be controlled. 
However, this regularization scheme is somewhat artifi- 
cial because of the sharp cut off in th e filter fac tors. A 
more natural scheme was developed bv lTikhonovl ([1963), 
which introduces a regularization parameter A to solve 

(^B^B + XI^x = B^b. (17) 

The Tikhonov solution for x is expressed as the standard 
SVD expansion with modified filter factors 



The solution of this system corresponds to the solution 
of Equation p3[) with the regularization matrix equal to 
the identity and the prior solution Xq =0. 

When ^ A, (pi w 1. For large z, <C A such 
that (j)i « i^i/X- Note that the eigenvalues of the Ns x 
Ns system matrix B B are the squares of the singular 
values, fii — vf . The sum of the Tikhonov filter factors 
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is then 
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Mi + A ' 



(19) 



This e xpression agrees with Equation (21) in lSuvu et alj 
(j2006l ). who show that the sum 7 represents the num- 
ber of source degrees of freedom in the problem when 
Tikhonov regularization is included. 

Iterative methods effectively add consecutive terms to 
the sum in Equation (ITSt with each step of the algo- 
rithm, such that the numbe r of ite rations itself acts as 
a regularization parameter ([Hankd [l995) . In order to 
find the best solution from an iterative optimizer, it is 
necessary to stop it near the optimal iteration, before 
the contributions due to noise in the solution grow too 
large. As can be seen in Figure [3l the CGLS algorithm 
converges significantly faster than the SD method. How- 
ever, the noise also rises more quickly past convergence, 
which makes the CGLS solution more sensitive to the 
stopping condition. The SD method is generally consid- 
ered a slower and less sophisticated local optimization 
algorithm than CGLS, but performance issues are out- 
weighed by SD's m ore stable behavior past convergence. 
INagv and Palmerl (2003) first noted that optimization 
schemes based on SD do not require as precise a stopping 
criterion as other methods, which makes it easier to find 
an approximation to the optimally regularized solution. 
In the absence of the L-curve criteria, we choose SD. 
When using a stopping condition based on the L-curve, 
CGLS is recommended due to the algorithms speed in 
obtaining better solutions. 

2.5. Monte Carlo Estimate of the Effective Degrees of 

Freedom 

The iterative optimizers we have considered in this pa- 
per can be expressed in terms of the SVD expansion, 
Equation ((T6)) . with unique expressions for the filter fac- 
tors As in the case of Tikhonov regularization, we 
associate the sum of these filter factors 7 with the num- 
ber o f effective degrees of freedom in the problem (jVogell 
I2OO2D . For the case of the CGLS algorithm, t hese fil- 
ter fa ctors are recursive in the singular values (jHansenl 
[1993). This poses a problem because we use the CGLS 
scheme without explicitly building the matrices, and the 
solution of the singular values presents difficulty when 
using large data sets. Furthermore, the recursive scheme 
can become unstable (' Hansenlll994[ ). To circumvent this 
problem, we use a Monte Carlo scheme to estimate the 
sum of the filter factors. In essence, this scheme intro- 
duces a Gaussian random vector b with zero mean and 
unit standard deviation that contains the same number 
of elements as the data vector b. While iteratively solv- 
ing for the solution vector x using the conjugate gradient 
method, we simultaneously solve a second system with 
noise vector b using the same CGLS coefficients {dk and 
/3fc in standard notation) to derive a corresponding vec- 
tor X. iHanke and HansenI (|1993[) and lGirardI (|1989f ) show 

that b^ (b—Ax) provides an estimate of the number of 
degrees of freedom in the original system with data vector 
b. Note, however, that this estimate approximately dou- 



bles the computational overhead of the standard CGLS 
method. We perform this calculation during each call to 
the CGLS algorithm, allowing an estimate of the reduced 
for each set of lens parameters. 

2.6. L-curve Analysis 

The iterative optimization algorithms used in the lo- 
cal optimization step (the inner loop of our optimiza- 
tion scheme) converge to lower spatial frequencies faster 
than higher frequencies, and therefore the high-frequency 
noise present in the source reconstructions can be sup- 
pressed by controlling the number of iterations of the 
local optimizer. In general, we wish to find a balance 
between t he image and the amount of source regular- 
ization (P ress et al.l 120071 ). Since the regularizing effect 
of iterative optimizers is implicit, we need a metric to 
evaluate the amount of regularization introduced at each 
itera tion. For simplicity, we u se zeroth-order re gulariza- 
tion (|Warren and Dvd '(|2003n : iSuvu et al.l (|2006[ )) in this 
paper which sums the squares of source pixel intensities, 
in order to quantify the regularizing effects of the local 
optimizers. By calculating the image and regular- 
ization measure, each iteration of the local 
optim izer, we can form an L-curve (iHansen and O'Learvl 
'1993") for each solution. In the standard image deblurring 
problem, the point associated with the "corner" of the 
L-curve represents the solution that best balances the im- 
age fitness and the amount of regularization introduced 
in modeling the source. This solution is found by deter- 
mining the point on the trade-off curve with maximum 
curvature. We parameterize the L-curve by arclength 
{x{s),y{s)), where x and y represent the regularization 
term and image respectively, and fit a cubic spline 
curve to x and y. The derivatives of the cubic spline 
curves with respect to the arclength can be calculated 
analytically, which provide a simple method to calculate 
the curvature k. The point of maximum curvature is 
found using the curvature formula: 



y'x"\ 



(a;'2 +y'2)5 



(20) 



We show a sample of L-curves in Figure [6] and corre- 
sponding source solutions in Figure [Sj including the so- 
lution located at the point of maximum curvature. In 
general, the solution found by the L-curve analysis agrees 
with the maximum Bayesian evidence solution to approx- 
imately 10%. The solution corresponding to the point of 
maximum curvature of the L-curve is used to evaluate 
the fitness of each set of lens parameters. 

2.7. A Large-scale Test 

We form the gravitationally lensed image of a large 
source to demonstrate the efficiency of our iterative 
matrix-free approach. The source is a square image of 
Af51, of dimension 512 x 512, show n in FigureEl obtaine d 
from the NED online data archive (jKennicut et al.|[2003f) . 
The lensed image is generated using an SIE lens defined 
on a 640 x 640 grid, with Einstein radius 6=3 arcsec, 
e = 0.4, and 9l = 7r/4, with the lens centered at the 
origin. To demonstrate the behavior of the code with 
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a complicated PSF, we used a PSF composed of a ra- 
dial sine func tion multiplied by an elliptical Moffat PSF 
(|Moffatlll969( ). which is shown in the figure. The result- 
ing function provides a non-symmetric PSF that contains 
significant low-level structure. Such a large PSF would 
require a very large non-sparse blurring matrix, whose 
linear size must necessarily match the number of image 
pixels which map to the source, in this case 3.34 x 10^ 
square. After adding Gaussian white noise, the S/N of 
the blurred observation is S/N — 20. The solution shown 
in the figure was computed by the CGLS method and has 
a reduced image Xr = 0.9954 and was found in 35 itera- 
tions that took 86.4 seconds using a single 2.4 GHz Intel 
processor. 

The next section discusses the global optimizers. Fer- 
ret and Locust, which we use to solve for the lens model 
parameters. Both are parallel codes that require approx- 
imately 10'' — 10^ lens parameter sets to be evaluated 
for a thorough search, optimization, and mapping of the 
parameter space. Assuming 5 x 10''^ evaluations, the lens 
parameters could be solved for this large-scale test prob- 
lem in approximately six days on an eight-core computer. 
Such a large-scale problem would be impractical using a 
matrix inversion scheme due to the large size of the ma- 
trices that would be involved. 

3. THE FULL OPTIMIZATION PROBLEM 

Section [2] focused mainly on the linear least squares re- 
construction of the source, for a known lens mass distri- 
bution. However, the full problem must also determine 
the optimal set of lens parameters. The lens parame- 
ters are solved as an 'outer loop' optimization problem, 
which calls the semilinear method, or alternatively our it- 
erative approach, as an inner loop optimization for each 
set of nonlinear lens parameters evaluated. The inner 
loop optimizes the lensed source by executing an arbi- 
trary number of iterative steps (in our examples, 40) of 
a local optimizer like the CGLS algorithm. The L-curve 
for each lens parameter set is built, and the optimally 
regularized solution that lies nearest the corner of this 
curve is found. The value of this optimally regular- 
ized solution is used to evaluate the fitness of the corre- 
sponding set of lens parameters. During the inner loop 
optimization, a statistical estimate of the number of de- 
grees of freedom for the optimally regularized solution is 
made and used to determine the reduced image dur- 
ing the analysis at the end of the run. In this paper, the 
outer loop problem is solved by the Ferret GA and Lo- 
cust PSO from the Qubist Global Optimization Toolbox 
([Fiege 2010). However, the Mirage code is not limited 
to either of these optimizers and can make use of any 
external nonlinear optimization scheme. 

Both Ferret and Locust are able to map out "fuzzy" 
optimal sets defined by an inequality. In this case, we 
request a distribution of solutions with < Xmin + ^u, 
where Xmin the lowest image value found and Nu is 
an upper limit selected at the start of the run. The up- 
per limit Nu is chosen to be large enough so as to include 
solutions within the 99% confidence interval. The mem- 
bers of the optimal set, along with the estimates for the 
number of degrees of freedom, allow us to determine solu- 



tions within the 99%, 95% and 68% confide nce intervals 
by the standard method ([Press et al.l 120071 ). Thus, we 
can easily estimate errors for the nonlinear lens parame- 
ters, since these global optimizers determine the form of 
the surface in the neighborhood of the global mini- 
mum. 

The source intensities may contain negative values 
since bounds cannot be imposed in direct matrix inver- 
sion, and are not enforced in our iterative schemes. A fi- 
nal source refinement step, discussed in Section [321 uses 
the GA and PSO as bounded optimizers to find the op- 
timal positive definite source distribution, with the lens 
parameters held fixed at their previously optimized val- 
ues. 

3.1. Global Nonlinear Optimization 

The Qubist Global Optimization Toolbox contains five 
global optimizers in total, all of which are designed to be 
interchangeable. Ferret and Locust are the most pow- 
erful and well-tested optimizers in the package, which 
makes them well-suited for our problem. Qubist includes 
more than 50 test problems, so me of which are discussed 
in its user's guide ([Fiegell2010[ ). 

GAs and PSOs differ greatly from local optimization 
routines such as CGLS and SD, which require an initial 
guess and then search iteratively along a deterministic 
trajectory through the parameter space. Such methods 
are prone to becoming trapped in local minima. More- 
over, these methods are usually implemented to solve un- 
bounded optimization problems, which may be less use- 
ful than bounded optimization when there are physical 
constraints on the parameters, such as the positivity of 
source pixel values in the lens reconstruction problem 
(see Section [2T|) . 

GAs and PSOs search the parameter space in parallel, 
making use of the collective behavior of numerous inter- 
acting "agents" - a population of individuals for a GA or 
a swarm of particles in the case of a PSO. These optimiz- 
ers distribute agents randomly throughout the parameter 
space initially, which subsequently interact using heuris- 
tic rules that aim to search the space thoroughly, and 
encourage the improvement of the population or swarm 
as a whole. In both types of algorithm, these heuris- 
tic rules are partly deterministic and partly stochastic. 
The resulting optimization algorithms are more powerful 
and robust than purely deterministic methods and vastly 
more efficient than random search. In general, only a sin- 
gle agent must find the high-performance region in the 
vicinity of the true global solution for the algorithm to 
succeed. Once such a solution is discovered, it is rapidly 
communicated to all other individuals or particles, which 
will accumulate near the global minimum and refine it. 

3.1.1. Genetic Algorithms 

GAs are an important class of algorithms for global op- 
timization that work in analogy to biological evolution. 
Evolution is biology's optimization strategy of choice, in 
which organisms evolve and continually improve their 
own designs as they struggle to survive. GAs are nor- 
mally discussed using biological terminology, such that 
each "individual" is a trial solution, whose parameters 
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are encoded on "genes" . The set of individuals is a "pop- 
ulation" , and individuals search the parameter space in 
parallel as they evolve over multiple "generations" . A ba- 
sic GA requires three genetic op erators, which a re mu- 
tation, crossover, and selection (jGoldberd Il989f ). The 
role of mutation is to apply occasional random pertur- 
bations to individuals, which helps them to explore new 
regions of the parameter space. Crossover mixes together 
two parent solutions to produce offspring that are inter- 
mediate between the parent solutions. The role of the 
selection operator is to choose which solutions propagate 
to the next generation, based on the Darwinian notion 
of survival of the fittest. Various types of selection op- 
erators are possible, but tournament selection has the 
advantage tha t it is insensitive to the scaling of the fit- 
ness function (jGold bcrg 2002). 

Ferret is a parallel, multi-objective GA, which has been 
under constant development since 2002, and is the most 
sophisticated optimizer in the Qubist package. The cur- 
rent version is the fourth m ajor version of the code, and 
earlier versions were used bv lFiege et all (|2004f ) to model 
magne tized filamentary molecular clouds, and by iFiegd 
(j2003) to model submillimeter polarization patterns of 
magnetized molecular cloud cores. Ferret extends the 
basic GA paradigm in several important ways, as dis- 
cussed below. 

Multi-objective optimizers like Ferret emphasize the 
thorough exploration of parameter spaces and the abil- 
ity to map trade-off surfaces between multiple objective 
functions, which allows the user to understand the com- 
promises that must be made between several conflicting 
objectives. A core feature of a multi-objective GA is the 
ability to spread solutions approximately evenly over an 
extended optimal set of solutions, which Ferret accom- 
plishes us ing a niching algorithm simi lar to the one dis- 
cussed bv lFonseca and Fleming! (|1993| ). Even for single- 
objective problems, Ferret's multi-objective machinery 
is well-suited to explore and map out intervals in the 
neighborhood of the global minimum. We see in Section 
|4] that it is especially useful for degenerate cases where 
multiple disconnected islands of solutions exist within 
the parameter space. 

Ferret's most novel and powerful fe ature is its 'linkage- 
learning' algorithm ()Goldbers!jl2002[ ). which is designed 
to reduce a complex, multi-parameter problem to a nat- 
ural set of smaller sub-problems, whenever such a re- 
duction is possible. These simpler sub-problems are dis- 
covered experimentally by Ferret during the process of 
optimization, and sub-problems evolve almost indepen- 
dently during a run. Ferret regards two parameters A 
and B as linked if finite variations of A and B are dis- 
covered, which result in worsening of a solution when 
applied independently, but the same variations applied 
together result in improvement. In such cases, it is clear 
that A and B should be linked so that they are usu- 
ally traded together during crossovers, to preserve gains 
made by varying the parameters together. A novel exten- 
sion of Ferret's linkage-learning algorithm is its ability to 
search entire sets of parameters {A^} and {Bi\ for link- 
age in parallel, which is assigned probabilistically to the 
parameters within these sets. Thus, Ferret treats linkage 



as a matrix of probabilities that co-evolves with the pop- 
ulation during the search. Parameters that appear linked 
at the start of a run may not appear linked at the end, 
when most solutions may be nearly optimal. Conversely, 
new links can also arise as the code explores previously 
uncharted regions of parameter space. 

The ability to partition a complicated problem into 
natural sub-problems is crucial to the successful opti- 
mization of large problems. A difhcult 100 parameter 
problem with many local minima is often unsolvable on 
its own, but it becomes quite tractable if it can be par- 
titioned into (say) 10 sub-problems (or building blocks) 
with 10 parameters each. A particularly interesting fea- 
ture of Ferret's linkage-learning system is that the link- 
ages discovered are entirely insensitive to scale. Two 
sub-problems (building blocks) that are orders of magni- 
tude different in importance are discovered at the same 
rate, so that Ferret can solve all of the sub-problems cor- 
rectly and simultaneously, rather than one at a time in 
order of significance. This ability allows Ferret to dis- 
cover the true, globally optimal solution or solution set, 
even when applied to problems with very poorly scaled 
building blocks. 

Ferret contains an algorithm that monitors its progress 
and uses this information to automatically adapt several 
of its most important control parameters, including the 
mutation scale, size scale of crossover events, and several 
others. If these parameters are set poorly by the user, 
Ferret quickly and dynamically adapts them to improve 
the search. This algorithm provides an extra layer of 
robustness to the code, which helps Ferret to adapt as 
different regions of the fitness landscape are discovered. 

Ferret, and the other global optimizers of the Qubist 
toolbox, place considerable emphasis on visualization. 
The analysis window displayed at the end of a run con- 
tains various graphics options to tease out interesting fea- 
tures from the optimal set. These features include two 
and three-dimensional scatter plots, image plots, contour 
plots, and user-defined graphics. It is possible to 'paint' 
interesting regions of the parameter space and select dif- 
ferent two and three-dimensional projections to explore 
and visualize where the painted solutions reside in a high- 
dimensional parameter space. 

Modeling a gravitational lens system is a computation- 
ally intensive task that requires approximately 10^ — 10^ 
parameter sets to be evaluated for a single run. GAs 
are well suited to parallel computing because each indi- 
vidual in the population represents a single parameter 
set, which can be evaluated independently. Ferret is de- 
signed with built-in parallelization to take advantage of 
multi-CPU computers and inexpensive clusters. Parallel 
jobs are managed with a graphical "node manager" tool, 
and no changes are required to the implementation of the 
user's fitness function. It is notable that Ferret does not 
require MATLAB's parallel computing toolbox, or use 
any other third-party parallel computing software. 

The Appendix discusses some additional details of the 
Ferret algorithm. 

3.1.2. Particle Swarm Optimizers 
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Locust is a parallel raulti-objective PSO in the Qubist 
toolbox. PSOs are biologically inspired global optimiz- 
ers, which search the parameter space using a swarm of 
interacting particles. PSOs are often discussed in terms 
of the dynamics of flocks of birds, schools of fish, or 
swarms of social insects searching for food. The com- 
monality is that intelligent search behavior emerges as 
property of the system as a whole, even if the compo- 
nent parts are modeled as relatively simple automata 
that interact with e ach other through simple rules. 
iKennedv &: Eberhartl pOOl) provides a good introduc- 
tion to the PSO technique. 

PSOs are similar to GAs in that they sample many 
points in the search space simultaneously, with a swarm 
of particles moving through the parameter space follow- 
ing simple dynamical equations. Each particle in a sim- 
ple PSO is simultaneously attracted to its own "personal 
best" solution, which is the best solution that the parti- 
cle has personally discovered, and the "global best" so- 
lution, which is the best solution that the entire swarm 
has ever encountered. The law of attraction follows a 
simple spring law: F oc |Ax|, where |Ax| is the distance 
between a given particle and either the personal best so- 
lution Xp or the global best Xg . Assuming that the force 
and velocity are approximately constant over a time step, 
the new velocity and position of particle i after a time 
step At are given by 

V,{t + At) = Vi{t) (1 - At/tdamp) + 

[Cp^p(Xp - Xi) -I- Cg^g{Xg " X^)] At 

x,(< + At)=x,(t) +v,(t)At, (21) 

where Cp and Cg play the role of spring constants for 
the personal and global best solutions respectively. The 
equations include a damping term to decrease the veloc- 
ity magnitude in approximately time tdamp, which helps 
the swarm settle down as it zeros in on the optimal re- 
gion. Damping also serves to prevent runaway growth in 
so-called 'particle explosions', which can occur as a result 
of accumulated errors in Equation (|2T|) . Some random- 
ness is added via the uniform random variables and 
^g, which are typically drawn from the range 0-1. The 
stochastic terms play a role similar to the mutation oper- 
ator in a GA; they add randomness to the search, which 
helps the particles to explore previously unexplored parts 
of the parameter space. The roles of the personal and 
global best solutions are clear. The personal best solu- 
tion represents a particle's memory of the best region of 
parameter space that it has seen, and the global best 
solution represents the entire swarm's collective mem- 
ory. In effect, the global best solution allows indirect 
communication between particles to encourage collective 
behavior. 

Particle swarm optimization is a young and rapidly 
changing field of research that still has many ope n ques- 
tions, which are discussed in a recent review W iPoli et al.l 
(^007). Equation (|2ip is perhaps the simplest set of 
swarm equations, but many alternative implementations 
are possible, which strive to balance thorough explo- 
ration of the parameter space against the need to exploit 
high performance regions when they are found. 

Equation (j2ip is equivalent to a simple Euler integra- 



tion scheme for a dynamical system of equations that 
move each particle every time step. However, Locust uses 
an exact solution to the swarm equations, which is eas- 
ily obtained by solving Equation (|2ip analytically, in the 
limit At 0. Numerical experiments with Locust, and 
alternate schemes that use Euler integration, show that 
the exact sol ution results in more stable and reliable PSO 
(|Fiegell2010[ ). It is possible that the exact solution elim- 
inates the build-up of errors in the orbits, which would 
result from applying Equation (|21|) directly with a finite 
At. The exact solution is slightly more costly to evaluate 
than the Euler approximation, but this extra computa- 
tional expense is insignificant for any realistic problem, 
where the computational time is normally dominated by 
the evaluation of the fitness function. 

Determining Xp is straightforward because it repre- 
sents the personal best solution (often denoted pbest) 
that any particle has encountered. Thus, each particle 
simply keeps track of the position where it encounters 
the lowest value of the fitness function F(x), following 
Ferret's convention that lower values of F correspond to 
more desirable solutions. 

The most common particle swarm implementation is 
the simple PSO discussed above, where the global best 
solution Xg (gbest) is evaluated over the entire swarm. 
This swarm topology can be thought of as a fully con- 
nected graph, where each particle in the swarm com- 
municates with every other particle via the gbest solu- 
tion. Other swarm topologies are possible, where the 
network of communication between swarm members is 
less densely connected, so that each particle only com- 
municates with a few other particles in its neighborhood. 
In this case, the gbest solution is replaced by a set of local 
best, or Ibest solutions, such that each West solution is as- 
signed to a subset of the swarm. This scenario is referred 
to as a static Ibest topology when the network connect- 
ing particles do not change throughout the run. Dynamic 
Ibest topologies are also possible, where the network co- 
evolves with the swarm as the run progresses. Swarms 
based on sparsely connected networks can be thought of 
as being divided into sub-swarms, where each sub-swarm 
shares a common Ibest solution. Such a topology is better 
able to avoid local minima because the sub-swarms ex- 
plore the space in parallel. On the other hand, the fully 
connected gbest topology is best for exploiting a single 
isolated solution late in a run, because it focuses the ef- 
forts of the entire swarm on the region of parameter space 
in the vicinity of the gbest solution. 

Locust requires some non-standard techniques de- 
signed to thoroughly explore parameter spaces contain- 
ing sets of solutions that are equally good. Extended 
solution sets are also possible when a fuzzy tolerance is 
specified for a single objective problem, which often rep- 
resents the error tolerance of a data-modeling prob- 
lems. Locust emphasizes the mapping of spatially ex- 
tended solution sets, and therefore it makes sense to de- 
fine particle neighborhoods dynamically, based on their 
spatial location within the swarm. The code keeps track 
of the Euclidean distances between all particles, and as- 
signs neighborhoods based on the nearest Ibest particle. 
Moreover, Locust implements a novel algorithm that al- 
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lows neighborhoods, and hence sub-swarms, to merge 
and divide as required to map out the structure of the 
optimal set. This dynamic swarm topology is quite differ- 
ent from other topologies discussed in the literature, and 
has the benefit that it essentially self-optimizes. A large 
number of neighborhoods will generally be preserved to 
map a spatially extended solution set, but the swarm 
topology will correctly collapse to a single neighborhood 
late in a run if only a single solution exists, thus reduc- 
ing the algorithm to a simple gbest approach. In prac- 
tice, this technique represents a good balance between 
exploration of the parameter space and exploitation of 
the optimal set; the parallel action of many sub-swarms 
evade local minima early in the run for all problems, and 
many are retained to the end when the focus is on map- 
ping an extended solution set, but swarms reduce to the 
maximally exploiting gbest algorithm late in the run for 
problems where only a single best solution exists. 

Locust uses the same visualization system as Ferret. 
It uses a simpler setup file than Ferret, but it can read 
Ferret's setup files and translate them. Moreover, the 
formats for the initialization, fitness, and custom graph- 
ics functions arc identical. This makes it easy to swap 
optimizers for comparison purposes. The Appendix dis- 
cusses some additional details about Locust. 

3.2. Source Refinement Routine 

We use a two-step process to solve the full inversion 
problem. In the first step, we determine the nonlin- 
ear lens parameters as described in Section [3l In the 
second refinement step, we hold the best set of lens 
parameters constant and allow the global optimizer to 
fit the source brightness distribution. We treat each 
source plane pixel as a free parameter and judge the fit- 
ness of solutions based on the image Xr statistic. This 
ty pe of pixelized sou rce fitting using a GA was outlined 
by iBrewer and LewisI (|2005ID . The Qubist global opti- 
mization routines are bounded, so positivity conditions 
on the source reconstruction are easily enforced in this 
step. Since the intensity of each source pixel is indepen- 
dent, this approach does not produce a regularizing effect 
and the number of degrees of freedom in the problem is 
well defined, allowing direct comparisons of lens models. 
Therefore, this two-step method allows an estimation of 
the errors on both the lens and source intensity parame- 
ters. 

The bounds used in the refinement step can signifi- 
cantly speed up this optimization. Figure |8] shows a se- 
quence of solutions with a lower bound of and an upper 
bound equal to 1.1 times the maximum pixel intensity 
in the source. These bounds ensure that the source is 
strictly positive but can significantly slow the optimiza- 
tion due to the large volume of the parameter space that 
is searched. Both of the global optimizers used in this 
report can include a user-defined solution in the first gen- 
eration. Therefore, a more practical optimization strat- 
egy is to consider the absolute value of the optimally 
regularized solution found by the iterative optimization 
process, and define a "window" of acceptable pixel inten- 
sity values for each source pixel. In our tests, a tolerance 
of ±25% of the pixel intensities is usually sufficient to 



bracket the true intensities. Pixels with negative inten- 
sities in the optimally regularized solution should always 
have a lower bound of to prevent artifacts in the source 
solution. The upper bound of these pixels is taken to 
be the absolute value of the pixel intensity plus 15%. 
Practically, this reduces the volume parameter space to 
be searched and generally allows a solution to be found 
more quickly. 

4. DEMONSTRATIONS OF THE FULL OPTIMIZATION 
PROBLEM 

In this section, we show results from several illuminat- 
ing test problems that solve the full lens reconstruction 
problem and characterize the behavior, performance, and 
limitations of the global optimizers. 

4.1. Trivial Solutions and the Problem of 
Dimensionality 

Consider a lens model based on a singular isothermal 
sphere, which provides a simple analytical model with 
a circularly symmetric deflection angle given by Equa- 
tion ()12|) in the radial direction. This deflection angle is 
used to form the synthetic data with velocity dispersion 
(Ty = 500 km s~^, centered at the origin {x,y) = (0,0). 
We construct artificial data where the Einstein ring has 
radius b = 1 arcsec, assuming source redshift Zd — 0.2, 
defiector redshift Zs — 1.5. For convenience, we again 
measure angular distances with respect to the Friedmann 
metric with k — 0. The lensed image is defined on a 
120 X 120 rectangular mesh with an image pixel size of 
0.015 arcsec. A 3 x 3 subsampling per pixel is used to con- 
struct the lensed image. The source is perfectly aligned 
with the lens center and forms a full Einstein ring due to 
the symmetry of the mass distribution. We have blurred 
the image using a Gaussian PSF with an FWHM of 2.35 
image pixels defined on a 33 x 33 grid. The test source 
is also a Gaussian model on a 50 x 50 square mesh from 
—3 to 3 arcsec in both directions. 

In the following discussion, we hold the x and y coordi- 
nates of the lens center constant, using the actual values 
from the artificial data, and plot Xr ^ function of b 
in Figure [HI As the size of the Einstein radius (velocity 
dispersion) is varied, the corresponding Xr statistic be- 
comes double peaked, with the true solution between the 
peaks. The area to the left of the peaks, the region of low 
b, contains trivial solutions that map the source almost 
straight through the lens, reproducing the image almost 
exactly with minimal distortion. In fact, the 6 = source 
does not include any gravitational lens effect at all, thus 
reducing the problem to a conventional image deconvolu- 
tion exercise. Note that this trivial solution results in re- 
duced Xr = 0.973, even though the reconstructed source 
is physically unrealistic. The surface varies smoothly 
as we approach the 'true' value of b with Xr — 0-986, 
and increases with b beyond this value. When b is large, 
we again begin to see a decrease in x^, to the asymptotic 
value of = 1.15, as the structure of the source becomes 
increasingly complex to compensate for the distortion in- 
troduced by the lens. Typically such high b solutions give 
rise to spurious images in which some pixels lie outside 
the boundaries of the image plane. We wish to avoid 
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the very low and very high b solutions, since they do not 
correspond to physical solutions of the problem. 

With the lens center fixed, the above example is a sim- 
ple one parameter problem, which can be easily solved 
by a global optimization routine designed to map a range 
of values near the global minimum. However, anal- 
ogous examples may exist in more complicated systems 
with more parameters, where the parameter space can 
become dominated by trivial solutions. The problem be- 
comes especially difficult when false solutions, such as 
the trivial ones in Figure [SI occupy a region of space 
whose dimensionality is greater than the true solutions. 
In such cases, GAs and PSOs can fail when the number 
of search agents is too small for the problem, since the 
entire population or swarm may be drawn into the re- 
gion of trivial solutions and never discover the class of 
true solutions that occupy a region of lower dimensional- 
ity. Even if the high-performance region containing the 
true solution is discovered, both Ferret and Locust are 
designed to spread solutions evenly over the optimal re- 
gion, which contains the trivial solutions if the goal is 
to map the solution set within Ax^ of the minimum. 
Thus, the population or swarm may become diluted by 
spreading out over the trivial region, which has higher 
dimensionality. The right panel of Figure [S] shows the 
results of keeping the Einstein radius b fixed at its true 
value and varying the lens center {x,y). The true so- 
lution point, with — 0.986, is surrounded by a ring 
of poor solutions, which signifies multiply imaged solu- 
tions. In this projection, trivial solutions occupy a two- 
dimensional plane at large radius and have Xr = 0.978. 
In order to overcome the complication of trivial solutions, 
an estimate of the range of acceptable parameter values 
is made. By imposing such parameter restrictions the 
algorithm is guaranteed to find a non-trivial solution to 
the optimization problem. Notably, Ferret also imple- 
ments a novel algorithm that promotes the speciation of 
the population into isolated clusters, which may help to 
overcome this difficulty. 

4.2. A Realistic Test 

For a more realistic and complicated test, consider the 
SIE lens model presented in Section 12.31 We use the 
same parameters to solve a test system, with the source 
intensity profile as given by Equation ([9]). We fix the 
redshift of the deflector and source as in the previous 
example, and model the parameters of the lens density 
model using both Ferret and Locust. The fitness objec- 
tive to be minimized is the standard x^ • The parameters 
of the best solutions are summarized in Table 1. Both 
algorithms automatically map the region of parameter 
space near the minimum by heavily populating this re- 
gion of parameter space. The effective number of degrees 
of freedom for each model is estimated and saved during 
the course of the run. By using these quantities we are 
able to estimate confidence intervals and the errors of the 
lens parameters. The structure of the global x^ surface 
is calculated at the end of the run using the members 
of the optimal set, saved from each generation (Ferret) 
or time step (Locust). Figure fTOl shows that Ferret more 
thoroughly explores the parameter space than the Locust 



algorithm. 

We find that the GA and PSO converge approximately 
at the same rate. Figure [11] compares the performance 
of the algorithms by plotting the fitness of the best solu- 
tion as a function of the number of function evaluations, 
while Figure ITOl shows the distribution of solutions in the 
parameter space. Figure [TUj shows that Ferret correctly 
discovers a pair of equally good degenerate solutions sym- 
metric in orientation angle, but Locust picks out only one 
of these groups, which refiects Ferret's greater emphasis 
on mapping the parameter space. Baran (2009) used 
these same optimizers to estimate the system tempera- 
ture of the DRAG synthesis array and noted that Locust 
found solutions significantly faster on average. We do 
not find the same behavior of the PSG in this problem. 

4.3. Source Refinement 

Once we have determined the lens parameters, we hold 
them constant and begin the final source refinement step 
of the optimization, which involves 2500 parameters for 
the case shown. The source refinement results in an op- 
timal non-negative source intensity distribution, as dis- 
cussed in Section 13.21 The image is of size 120 x 120, 
while the source plane is defined as a 50 x 50 grid. 
The solutions at the beginning of this step appear to 
be comprised purely of noise, but an approximation to 
the true solution becomes increasingly well defined as 
the run progresses, and the image residuals gradually 
become featureless. Figure [Sj shows an evolutionary se- 
quence of the lowest Xr solution, where the final solution 
has Xr — 1.05. The search is a bounded linear prob- 
lem, which is mathematically simpler than the nonlinear 
search for lens parameters. However, the large number 
of parameters complicates the optimization and the GA 
converges in a few thousand generations. Source refine- 
ment is the most computationally expensive part of this 
problem, requiring approximately five days on an eight 
core machine. The most useful aspect of this intensive 
search is to estimate errors on the source plane pixels 
determined by the best fit lens model. 

Ferret's convergence on the source refinement problem 
is shown in Figure [T2] The smooth convergence curve is a 
hallmark of linear or other easy problems. We have noted 
that the source reconstructions begin fitting to noise in 
the target image slowly, so it is generally quite easy to 
find an acceptable termination criteria for the algorithm. 
Since each pixel in the source is independently treated 
by the GA, this problem cannot be expressed in terms of 
the SVD expansion in the same way that the solutions 
to a linear optimization step can. However, to quan- 
titatively ensure that overfitting to noise is prevented, 
we once again form the L-curve between the image x^ 
and a linear regularization measure ^ sf to quantify the 
amount of noise in the source. In practice the L-curve 
analysis in the final analysis step is of limited use due to 
the smooth convergence of the algorithm and the slow 
rise in noise in the reconstructed sources. Ferret is able 
to converge to a source near the location of the true so- 
lution for all situations that we have tested. The best 
solution typically agrees with the true source to within 
15% though we have noticed variation in the details of 
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the derived sources from run to run, which is expected 
considering the large number of parameters involved in 
the optimization. It is interesting, and perhaps surpris- 
ing, that the Locust PSO is unable to solve this problem, 
despite its linearity. We conclude that a GA is a more 
robust and efficient approach than particle swarm opti- 
mization for both of these optimization problems. When 
the problem is small, a PSO can often find the solution 
in a comparative amount of time as a GA. 



5. CONCLUSIONS 

The semilinear method provides an elegant way to de- 
scribe gravitational lens inversion in terms of a least 
squares problem, but is limited to relatively small im- 
ages and a narrow PSF. This is due to the fact that the 
semilinear method requires the inversion of a large matrix 
whose size increases as the fourth power of the number of 
source pixels, and the sparsity of this matrix is reduced 
as larger PSFs are used. Solving for lens parameters is 
a nonlinear optimization problem, which can be solved 
by global optimization techniques. We applied and com- 
pared the Ferret GA and Locust PSO to determine the 
nonlinear parameters of the lens model. The global opti- 
mization of lens parameters requires a lens inversion for 
each set of lens parameters tested, and 10* — lO'^ such 
evaluations are required for a thorough exploration of 
the parameter space and mapping of the optimal region. 
This reinforces the need for fast lens inversion techniques 
that scale well with the size of the image and PSF. 

We addressed the need for a fast lens inversion al- 
gorithm by developing a matrix-free approach to solve 
the least squares lensing problem, based partly on re- 
cent developments in the image deblurring literature, 
which solves the problem without the need to explic- 
itly build the lens or blurring operators. This approach 
is intended to complement the semilinear method when 
speed is of the essence, or when large images and broad, 
highly structured PSFs are used. We note that our ap- 
proach can be extended to the case of a spatially variant 
PSF. Our analysis evaluated the convergence behavior 
of a matrix-free method using several local optimization 
methods. We found that the CGLS method is fastest to 
converge, but all linear optimization schemes suffer from 
over-fitting of noise if the optimization is not stopped at 
the critical iteration, which cannot be predicted a priori. 
We showed that steepest descent methods are more ro- 
bust against over-fitting to noise at the expense of the 
speed of convergence. 

The number of degrees of freedom in the iterative 
optimization step is estimated using a Monte Carlo 
method, allowing us to draw connections to the work of 
iSuvu et all (|2006[ ) that estimate the number of degrees of 
freedom using Bayesian statistics. We derived a formula 
for the number of degrees of freedom based on the filter 
factors of the Tikhonov regularizati on problem, which 
agrees with the expression found by iSuvu et aD ()2006l ) 
using Bayesian analysis. 



We developed a novel method that computes the opti- 
mally regularized solution for each set of lens parameters 
by finding the point of maximum curvature in the trade- 
off curve between and a measure of the amount of 
regularization in the solution, which we took to be the 
sum of the squares of source pixel intensities. The am- 
biguity of choosing a regularization parameter or stop- 
ping criteria is removed, because we automatically de- 
termine the optimal number of iterations (regularization 
constant) using the L-curve. We evaluate the fitness of 
lens parameter sets using the image statistic. 

The convergence and parameter space mapping prop- 
erties of the Ferret GA and the Locust PSO schemes were 
compared, and we determined that the GA explores the 
parameter space more thoroughly than the PSO. The 
GA obtained a more detailed optimal set of solutions, 
highlighting the degeneracy in the position angle of a 
Singular Isothermal Elliptical lens model due to the ro- 
tational symmetry of the lens. Both methods converge 
at a similar rate. 

As a final refinement step in the image reconstruction 
our approach uses the GA or PSO to directly solve for 
pixel intensities. This addition has the important benefit 
that the non-negativity of the source intensity profile can 
be enforced. It is notable that the Ferret GA was able 
to solve this bounded linear solution refinement problem, 
but the Locust PSO failed due to the high dimensionality 
of the search (^2500 parameters). This analysis step 
shows stable convergence, and noise is introduced to the 
source very slowly. In practice this routine is relatively 
insensitive to stopping criteria. 

This paper serves as a foundation for future ex- 
plorations, which will apply the techniques discussed 
here to data, and expand them to include non- 
parametric lens model s , suc h as those discussed by 
iVegetti and KoopmansI (|2009D and iSaha et al.l (|2007D . 
Non-parametric lens density models are extremely valu- 
able, since dark matter haloes m ay contain signifi- 
cant substructure (jKoopmansI l2005f ) that is not taken 
into account by analytical lens models. GAs have 
been applied to t h is pro blem previously, sp ecifically by 
i Liesenborgs et al.l (|2007D . usin g the work of iDiego et al.l 
(2005) as a starting point. ILiesenborgs et al.l ()2009f ) 
used such an approach to model the system SDSS 
J1004 -I- 4112. This approach could be used in conjunc- 
tion with the semilinear method to model complicated 
non-parametric lens density distributions and reveal the 
details of lensed extended sources. 
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APPENDIX 
FERRET AND LOCUST DETAILS 

This Appendix offers a few additional details about the Ferret and Locust optimizers used in this pa per and how 
their strategy parameters were set. A more complete description is given in the Qubist User's Guide (jFiegd [20To() , 
which is available from J. Fiege. 

Ferret Genetic Algorithm 

Most GAs encode model parameters on binary strings ()Hollandlll975l : lGoldber3ll989() , with mutations and crossovers 
defined as operators that work directly on these strings. For example, a mutation would typically fiip a single bit, 
while a simple crossover would cut two binary strings at the same position and exchange the parts of the string to the 
right of the cut, effectively mixing together two individuals in the population. If these strings represent real valued 
parameters of a model, it is necessary to decode the binary representation into real numbers prior to evaluation. 
Ferret is specialized to work directly with genotypes specified by a list of real-valued parameters, thus side-stepping 
the conversion from binary strings to real numbers. An individual in Ferret is therefore represented by a point in 
an iV-dimensional real vector space, where N is the number of parameters or "genes" , which allows more elaborate 
mutation and crossover operators than can be defined on a simple binary string. 

Ferret contains many options, which are controlled by "strategy parameters" that are encoded in a MATLAB 
structure called par. The strategy parameters are defined by a setup file, which is read at the start of a run. Ferret 
contains a default setup file, which fills in any strategy parameters not specified by the user. These default values 
are often adequate and the software is not usually very sensitive to the exact choice of strategy parameters. This 
robustness is achieved in part by an adaptive algorithm that automatically controls several of the most important 
control parameters, affective mutations and crossovers. 

A Ferret run evolves par . general . NPop populations, where the size of each population is set by 
par. general. popSize. Ferret uses a single population by default, and it is recommended to set the population 
size in the range of 100 — 500. Generally, this choice is guided by the computational expense of evaluating the fitness 
function, the complexity of the problem, and the user's experience solving it. Larger populations tend to explore the 
parameter space more thoroughly than smaller populations, but at greater cost. When par . general . NPop > 1, the 
populations interact weakly with each other by exchanging individuals with probability par . immigration . PImmigrate 
PS 0.01 each generation. This is beneficial for some very difficult problems because multiple populations explore the 
parameter space almost independently, thus increasing the probability of finding the global solution. Ferret often per- 
forms better on very difficult problems when the total number of individuals is divided into several populations rather 
than placing them all into a single population. However, we used a single population with par . general . popSize=200. 

Ferret's mutation operator is defined as a perturbation in an A'^-dimensional real vector space, where the magnitude 
of the perturbation is drawn from an initially Gaussian distribution, whose standard deviation is determined by a 
strategy parameter par.mutation.scale=0.25 by default. The distribution of mutation scales is under adaptive 
control, and evolves during each run, as Ferret preferentially selects values that result in improved fitness. Ferret's 
default mutation rate is given by par. mutation. PMutate=0. 05. 

The role of crossover in a GA is to mix together two different solutions to produce offspring that are intermediate 
between the parents. Ferret contains two different crossover operators, which mix genes in fundamentally different 
ways. Ferret's "X-type" crossover operator is a geometry-based operator that can be shown to be analogous to the bit 
string operator found in traditional binary encoded GAs. X-type crossover is essentially an averaging operation, which 
draws a line between the parameter space coordinates of two individuals and selects a point between the individuals 
on that line. The fractional distance traveled along this line is drawn from a distribution, which was initialized to 
a Gaussian random distribution of standard deviation par . XOver . strength=0 . 25 at the beginning of the run. The 
distribution of crossover strengths is under adaptive control and co-evolves with the population to prefer crossover 
strengths that tend to result in improved fitness. Note that it is possible to occasionally overshoot during a crossover 
by drawing a crossover strength greater than one. Surprisingly, this turns out to be beneficial on many problems 
because it helps to expand the population into long, slender valleys by occasionally overshooting the end points of the 
distribution. X-type crossover is Ferret's primary search mechanism, so we normally set par . XOver . PXOver=l to set 
the crossover probability to 100%. 

Ferret's "building block crossover" operator is at the heart of its linkage-learning system and has no analogy in 
traditional GAs. This type of crossover exchanges a building block, or a group of parameters previously identified as 
linked, in their entirely from one individual to another. Building block crossover efficiently propagates building blocks 
responsible for high-quality solutions throughout the population and mixes them with other high-performing building 
blocks comprised of other parameters. We normally set par . XOverBB . PXOver=l, which indicates a 100% chance of 
mixing building blocks. 

Ferret makes a duplicate copy of all populations prior to mutation and crossover, effectively doubling the number of 
individuals. Ferret's selection operator is applied after the mutation and crossover operators, using a binary tournament 
scheme in which individuals are drawn randomly from the populations modified by mutation and crossover to compete 
against individuals drawn from the unmodified duplicate populations. Individuals that win a tournament are allowed 
to propagate to the next generation and the losers are destroyed. The probability of competition is normally 100%, 
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but it is possible to reduce the selection pressure by setting par .selection, pressure < 1. This delays convergence, 
thereby allowing more time for exploration, by causing Ferret to ignore fitness values during tournament selection with 
probability equal to 1-par. selection. pressure. 

Sometimes a second round of competition is required when individuals tie in a tournament. This occurs commonly 
in multi-objective problems, when par. selection. pressure < 1, or when a fuzzy tolerance has been defined for a 
single-objective problem. For example, we map out some region of the parameter space within A^^ {dchi2) of the 
minimum value by setting par. selection. FAbsTol=dchi2 to tell Ferret to ignore diffe rences in fitness less than thi s 
amount. In this case, Ferret employs a niching strategy similar to that discussed bv iFonseca and Fleming! (|1993f ). 
which prefers solutions with fewer near neighbors over solutions with a greater number of neighbors. The logic behind 
this preference is simple: solutions in a less populated region of parameter space are more unique, and therefore more 
valuable to the exploration of the space. 

Locust Particle Swarm Optimizer 

Locust is a relatively simple code to configure, compared to the myriad of options allowed by Ferret. 

The most important strategy parameter controlling a PSO is the number of particles in the swarm, given by 
par. swarm. N. In general, larger swarms tend to explore the parameter space more thoroughly, but may require more 
time to do so. Very small swarms are problematic because they may sample the space poorly and miss the global 
solution. There is no established rule for choosing the swarm size. One typically starts with about 100 particles and 
decreases the number of particles if experience shows that this decreases the run time without causing problems with 
reliability. Very difficult problems may require more than 100 particles, and we used par. swarm. N=200. 

par. swarm, eg and par. swarm, cp are, respectively, the global best and personal best constants used in Equation 
pTj) . Both of these parameters should be of order unity, but setting eg slightly less than cp is usually helpful because 
this places more emphasis on exploration of the parameter space because the particles are infiuenced less by the global 
best solution. Increasing eg relative to ep places more emphasis on the exploitation of the global solution or solutions, 
at the expense of exploration, because all particles will be drawn to the optimal region more rapidly. We used the 
default values: par.swarm.cg=0.5 and par . swarm. cp=l. 

par. swarm, dt is the time step between updates to the swarm positions and velocities. Therefore, the time step 
dt affects the rate of sampling of the parameter space as particles move around on their orbits, but has no effect 
on the accuracy of the obits because Locust uses an exact solution to the simple harmonic oscillator orbit equations 
approximated by the finite difference equation given by ([21]) . We used the default value par. swarm. dt=l. 

PSOs require damping to cause the particle swarm to settle down to a converged solution. Locust is designed such 
that par. swarm. TDamp=l corresponds to a critically damped harmonic oscillator. Generally, underdamped oscillations 
are required so that multiple orbits explore the parameter space before the swarm converges. We used the default 
value for the damping time par . swarm. TDcmip=10. 
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Best Fit Solution 


Ferret GA 


Locust PSO 


True 


Lower Limit 


Upper Limit 


Xr 


1.010 


1.012 


0.998 








260.002 


259.892 


260.000 


250.000 


280.000 


€ 


0.401 


0.399 


0.400 


0.200 


0.500 


X 


-2.101 X 10"'' 


-2.123 X 10-"* 


0.000 


-0.500 


0.500 


y 


0.119 


0.120 


0.120 


-0.500 


0.500 




4.713 


4.713 


7r/2 


0.000 


27r 



TABLE 1 

Optimal lens parameters found by the Qubist optimizers for the example given in the text using the SIE lens. We restrict 

THE RANGE OF THE LENS PARAMETERS TO PREVENT CONVERCJENCE TO A TRIVIAL SOLUTION. PERFORMANCE OF THE GA AND PSO 

routines is similar as can be seen FROM THE REDUCED IMAGE ■ 
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Fig. 1. — Top row, from left: artificial data, source intensity distribution and Gaussian PSF used to generate the observation. The source 
was built on a 50 X 50 grid, and the lensed image is defined on a 120 X 120 grid. Middle row, from left: model observation of resulting 
source intensity reconstruction, source intensity profile as found by the semilinear method, and the resulting image residuals. Zeroth 
order regularization with a regularization constant A = 2.5 X 10"'^ was used to reconstruct the source. Bottom row, from left: resulting 
model image, model source and image residuals as determined by the CGLS algorithm after 40 iterations. Note the similarity between the 
semilinear and iterative solutions with respect to the derived source. Although both of these models contain back-traced noise, the real 
features of the source are reproduced and clearly visible in the reconstructions. 
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Image convergence vs. number of iterations 

1 1 1 I I I 

Semi-Linear 




5 10 15 20 25 30 35 40 

Number of iterations 

Fig. 2. — Convergence properties of several local optimization routines. The CGLS and LSQR IIBiorcklll996l) algorithms exhibit identical 
behavior, and the performance of the GMRES (Saad and Schultz 1986) algorithm is similar. The steepest descent algorithm converges 
more slowly but attains a slightly lowrer image value. The difference between the local optimization routines and the semilinear method 
is emphasized on this plot due to the logarithmic scale. The semilinear method result was obtained using a zeroth order regularization 
constant A = 2.5 X 10""^, and iterative algorithms were terminated after 40 iterations. 
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Q I I I I I I I I 

5 10 15 20 25 30 35 40 

Number of iterations 

Fig. 3. — Convergence properties of the source intensity distribution. This figure plots the relative error between a given iterative method 
and the true source intensity distribution. All of the iterative optimization algorithms display semi-convergence behavior. The semilinear 
method result was obtained using a zeroth-order regularization constant A = 2.5 X 10~^, and iterative algorithms were terminated after 40 
iterations. The relative error of the solution found by the SD algorithm increases more slowly past convergence than the error for any of 
the other iterative schemes shown here. 
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Fig. 4. — Picard plot for Gaussian PSFs with full width at half-maximum of 0.94, 1.64, and 2.12 pixels, respectively. The points on the 
black curve are the singular values Vi and the small dots are the expansion coefficients \ujb\. As the PSF becomes increasingly large, the 
singular values drop below the expansion coefficients more quickly. This drop-off signifies an increased contribution of high frequency noise 
in the solution as can be seen in the model solutions shown. 
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L-Curves 

5.8 1 1 1 1 — 




4.6 ' ' ' ' ' ' ' 

6 6.2 6.4 6.6 6.8 7 

ln(||s||) 

Fig. 5. — Variety of L-curves for the sys tem shown in Figure[T] The curve marked with the square, circle and triangle is the true solution, 
with the parameters described in Section 12.31 Each successive curve has the same parameters as the true solution except for the velocity 
dispersion, which takes the values 260, 262.5, 265, 267.5 and 270 km s~^, respectively. The location of the optimally regularized solution 
balances the residual and solution norms, denoted by the corner of the curve and marked by a circle. The optimally regularized solution 
for the true set of lens parameters has reduced = 0.998, = 658.4, found after 7 CGLS iterations. 
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Fig. 6. — Sources corresponding to the solutions marked in Figure [5] Left: oversmoothed solution (square), 3 CGLS steps, reduced 
= 1.228, \\si\\ = 613.2. The middle panel shows the optimally regularized solution in Figure [5] (circle) after 7 CGLS steps, reduced 
Xr = 0.998, = 658.4. The panel on the right shows the solution (triangle) after 18 iterations, reduced Xr — 0.965, \\si\\ = 740.4. 
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Fig. 7. — Large scale test of Mirage. Top left: original image of M51 used to generate a large-scale test problem. Top right: model 
source obtained with the CGLS algorithm after 40 iterations. The original image was obtained from NED, originally 700 X 700, cropped 
to 512 X 512 pixels. The effect of the lens mapping can be seen as the source plane is not completely covered by the back-traced image. 
Middle left: lensed image of M51 as seen through an SIE lens used as artificial data. The image is comprised of 640 X 640 pixels, over an 
area of 25.5 arcscc^. Middle right: model image of M51 as produced by the CGLS algorithm after 40 iterations. Bottom left: PSF used to 
blur the observation, shown in logarithmic intensity to highlight the low-lcvcl structure. The function is a 65 X 65 pixel PSF. Bottom right: 
residuals obtained from comparing original and model images. The residuals are featureless and have a maximum lO^'' of the original 
image maximum. The reconstruction has a reduced image = 0.9954. 
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Fig. 8. — The lowest Xr solution at 100, 250, 500, 750, 1000, 1250, 5000 generations. Left column: model image. Middle column: source 
brightness distribution. Note the presence of reconstructed noise. Right column: image residuals. At 5000 generations, a model image with 
Xr = 1-05 wa-s found. Each image is independently scaled to highlight image features. 
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Fig. 9. — Left: fitness as a function of Einstein radius for a symmetric lensed image produced by a background Gaussian source intensity 
distribution. The Einstein radius was varied manually with the lens center fixed at the origin. The dashed line indicates the lower limit 
used to model the system, and the dash-dotted line is the upper limit used to restrict the value of the Einstein radius. The region between 
these two limits is the region in which the true solution is located. Middle: Einstein radius limits are shown superimposed on the artificial 
data. Right: fitness as a function of lens center. The lens center position was varied over a 64 X 64 grid and the lens normalization fixed 
at the true value, 6=1 arcsec. Trivial solutions populate the corners of the image. The true solution lies at the center of the image. 
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Flc;. 10. — Parameter space plots of the SIE example in Section 4. The optimal set of solutions determined by global optimization marked 
by points shaded according to position within the 99%, 95% and 68% confidence intervals, represented by light gray, medium gray and 
black respectively. The location of the true solution is marked with a white cross. Top row: the Ferret GA optimal set. Bottom row: the 
Locust PSO optimal set. Left column: ellipticity e vs. velocity dispersion ctv, middle column: lens centre coordinates y vs. x, and right 
column: orientation angle 6]^ vs. ct„. The PSO does not explore the structure of the parameter space as thoroughly as the GA. Note the 
rightmost column in which the orientation angle degeneracy of the system is detected by the Ferret GA but no corresponding solution 
group is present in the Locust PSO optimal set of solutions. 
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Fig. 11. — Average convergence history of the Ferret GA (solid line) and the Locust PSO (dashed line) as a function of the number of 
function evaluations. The convergence of the GA is more stable, as the PSO tends to converge in a series of steps as the parameter space is 
explored. The figure is plotted over 1.0 x 10^ function evaluations. We have averaged over four runs of the PSO and four runs of the GA. 
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Fig. 12. — Convergence history of the hnear parameters during source refinement stage using the Ferret genetic algorithm. 



